## X age job marital
## Min. : 1 Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:10298 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :20595 Median :38.00 technician : 6743 single :11568
## Mean :20595 Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:30891 3rd Qu.:47.00 management : 2924
## Max. :41188 Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing
## university.degree :12168 no :32588 no :18622
## high.school : 9515 unknown: 8597 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## loan contact month day_of_week
## no :33950 cellular :10505 may : 5536 fri : 3069
## unknown: 990 telephone: 5971 jul : 2842 mon : 3421
## yes : 6248 NA's :24712 aug : 2443 thu : 3491
## jun : 2126 tue : 3207
## nov : 1653 wed : 3288
## (Other): 1876 NA's:24712
## NA's :24712
## duration campaign pdays previous
## Min. : 0.0 Min. : 1.00 Min. : 0.0 Min. :0.000
## 1st Qu.: 104.0 1st Qu.: 1.00 1st Qu.:999.0 1st Qu.:0.000
## Median : 182.0 Median : 2.00 Median :999.0 Median :0.000
## Mean : 259.7 Mean : 2.58 Mean :962.5 Mean :0.173
## 3rd Qu.: 320.0 3rd Qu.: 3.00 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :4918.0 Max. :43.00 Max. :999.0 Max. :7.000
## NA's :24712 NA's :24712
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## failure : 4252 Min. :-3.40000 Min. :92.20 Min. :-50.8
## nonexistent:35563 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## success : 1373 Median : 1.10000 Median :93.75 Median :-41.8
## Mean : 0.08189 Mean :93.58 Mean :-40.5
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. : 1.40000 Max. :94.77 Max. :-26.9
## NA's :113
## euribor3m nr.employed y test_control_flag
## Min. :0.634 Min. :4964 no :37048 campaign group:16476
## 1st Qu.:1.344 1st Qu.:5099 yes: 4140 control group :24712
## Median :4.857 Median :5191
## Mean :3.621 Mean :5167
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
##
## contact month day_of_week duration
## cellular : 0 apr : 0 fri : 0 Min. : NA
## telephone: 0 aug : 0 mon : 0 1st Qu.: NA
## NA's :24712 dec : 0 thu : 0 Median : NA
## jul : 0 tue : 0 Mean :NaN
## jun : 0 wed : 0 3rd Qu.: NA
## (Other): 0 NA's:24712 Max. : NA
## NA's :24712 NA's :24712
## campaign
## Min. : NA
## 1st Qu.: NA
## Median : NA
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
## NA's :24712
aka qualitative analysis
## [1] "job"
## [1] "marital"
## [1] "education"
## [1] "default"
## [1] "housing"
## [1] "loan"
## [1] "contact"
## [1] "month"
## [1] "day_of_week"
## [1] "poutcome"
## [1] "job"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | admin. | blue-collar | entrepreneur | housemaid | management | retired | self-employed | services | student | technician | unemployed | unknown | Row Total |
## ----------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
## no | 3234 | 3072 | 505 | 326 | 944 | 487 | 438 | 1344 | 221 | 2172 | 297 | 118 | 13158 |
## | 0.867 | 0.930 | 0.932 | 0.886 | 0.891 | 0.768 | 0.892 | 0.919 | 0.682 | 0.892 | 0.856 | 0.908 | |
## ----------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
## yes | 495 | 232 | 37 | 42 | 116 | 147 | 53 | 119 | 103 | 264 | 50 | 12 | 1670 |
## | 0.133 | 0.070 | 0.068 | 0.114 | 0.109 | 0.232 | 0.108 | 0.081 | 0.318 | 0.108 | 0.144 | 0.092 | |
## ----------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
## Column Total | 3729 | 3304 | 542 | 368 | 1060 | 634 | 491 | 1463 | 324 | 2436 | 347 | 130 | 14828 |
## | 0.251 | 0.223 | 0.037 | 0.025 | 0.071 | 0.043 | 0.033 | 0.099 | 0.022 | 0.164 | 0.023 | 0.009 | |
## ----------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
##
##
## [1] "marital"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | divorced | married | single | unknown | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## no | 1479 | 8055 | 3599 | 25 | 13158 |
## | 0.897 | 0.900 | 0.856 | 0.862 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## yes | 170 | 892 | 604 | 4 | 1670 |
## | 0.103 | 0.100 | 0.144 | 0.138 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1649 | 8947 | 4203 | 29 | 14828 |
## | 0.111 | 0.603 | 0.283 | 0.002 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## [1] "education"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | basic.4y | basic.6y | basic.9y | high.school | illiterate | professional.course | university.degree | unknown | Row Total |
## ----------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
## no | 1294 | 767 | 2011 | 3083 | 3 | 1708 | 3767 | 525 | 13158 |
## | 0.899 | 0.926 | 0.924 | 0.891 | 0.600 | 0.894 | 0.859 | 0.850 | |
## ----------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
## yes | 146 | 61 | 166 | 379 | 2 | 203 | 620 | 93 | 1670 |
## | 0.101 | 0.074 | 0.076 | 0.109 | 0.400 | 0.106 | 0.141 | 0.150 | |
## ----------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
## Column Total | 1440 | 828 | 2177 | 3462 | 5 | 1911 | 4387 | 618 | 14828 |
## | 0.097 | 0.056 | 0.147 | 0.233 | 0.000 | 0.129 | 0.296 | 0.042 | |
## ----------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
##
##
## [1] "default"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | no | unknown | yes | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|
## no | 10225 | 2932 | 1 | 13158 |
## | 0.871 | 0.950 | 1.000 | |
## ----------------------|-----------|-----------|-----------|-----------|
## yes | 1515 | 155 | 0 | 1670 |
## | 0.129 | 0.050 | 0.000 | |
## ----------------------|-----------|-----------|-----------|-----------|
## Column Total | 11740 | 3087 | 1 | 14828 |
## | 0.792 | 0.208 | 0.000 | |
## ----------------------|-----------|-----------|-----------|-----------|
##
##
## [1] "housing"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | no | unknown | yes | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|
## no | 5952 | 342 | 6864 | 13158 |
## | 0.889 | 0.905 | 0.885 | |
## ----------------------|-----------|-----------|-----------|-----------|
## yes | 745 | 36 | 889 | 1670 |
## | 0.111 | 0.095 | 0.115 | |
## ----------------------|-----------|-----------|-----------|-----------|
## Column Total | 6697 | 378 | 7753 | 14828 |
## | 0.452 | 0.025 | 0.523 | |
## ----------------------|-----------|-----------|-----------|-----------|
##
##
## [1] "loan"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | no | unknown | yes | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|
## no | 10820 | 342 | 1996 | 13158 |
## | 0.885 | 0.905 | 0.897 | |
## ----------------------|-----------|-----------|-----------|-----------|
## yes | 1404 | 36 | 230 | 1670 |
## | 0.115 | 0.095 | 0.103 | |
## ----------------------|-----------|-----------|-----------|-----------|
## Column Total | 12224 | 378 | 2226 | 14828 |
## | 0.824 | 0.025 | 0.150 | |
## ----------------------|-----------|-----------|-----------|-----------|
##
##
## [1] "contact"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | cellular | telephone | Row Total |
## ----------------------|-----------|-----------|-----------|
## no | 8068 | 5090 | 13158 |
## | 0.853 | 0.948 | |
## ----------------------|-----------|-----------|-----------|
## yes | 1392 | 278 | 1670 |
## | 0.147 | 0.052 | |
## ----------------------|-----------|-----------|-----------|
## Column Total | 9460 | 5368 | 14828 |
## | 0.638 | 0.362 | |
## ----------------------|-----------|-----------|-----------|
##
##
## [1] "month"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | apr | aug | dec | jul | jun | mar | may | nov | oct | sep | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## no | 792 | 1924 | 38 | 2335 | 1709 | 89 | 4687 | 1333 | 130 | 121 | 13158 |
## | 0.798 | 0.883 | 0.567 | 0.917 | 0.894 | 0.486 | 0.934 | 0.899 | 0.546 | 0.573 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## yes | 201 | 255 | 29 | 210 | 202 | 94 | 331 | 150 | 108 | 90 | 1670 |
## | 0.202 | 0.117 | 0.433 | 0.083 | 0.106 | 0.514 | 0.066 | 0.101 | 0.454 | 0.427 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 993 | 2179 | 67 | 2545 | 1911 | 183 | 5018 | 1483 | 238 | 211 | 14828 |
## | 0.067 | 0.147 | 0.005 | 0.172 | 0.129 | 0.012 | 0.338 | 0.100 | 0.016 | 0.014 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## [1] "day_of_week"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | fri | mon | thu | tue | wed | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## no | 2442 | 2792 | 2762 | 2545 | 2617 | 13158 |
## | 0.895 | 0.908 | 0.872 | 0.880 | 0.882 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## yes | 287 | 282 | 404 | 348 | 349 | 1670 |
## | 0.105 | 0.092 | 0.128 | 0.120 | 0.118 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 2729 | 3074 | 3166 | 2893 | 2966 | 14828 |
## | 0.184 | 0.207 | 0.214 | 0.195 | 0.200 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## [1] "poutcome"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 14828
##
##
## | campaign_train[, a]
## campaign_train[, "y"] | failure | nonexistent | success | Row Total |
## ----------------------|-------------|-------------|-------------|-------------|
## no | 1317 | 11677 | 164 | 13158 |
## | 0.863 | 0.912 | 0.331 | |
## ----------------------|-------------|-------------|-------------|-------------|
## yes | 209 | 1129 | 332 | 1670 |
## | 0.137 | 0.088 | 0.669 | |
## ----------------------|-------------|-------------|-------------|-------------|
## Column Total | 1526 | 12806 | 496 | 14828 |
## | 0.103 | 0.864 | 0.033 | |
## ----------------------|-------------|-------------|-------------|-------------|
##
##
## [1] "job against marital"
## [1] "job against education"
## [1] "job against default"
## [1] "job against housing"
## [1] "job against loan"
## [1] "job against contact"
## [1] "job against month"
## [1] "job against day_of_week"
## [1] "job against poutcome"
## [1] "marital against job"
## [1] "marital against education"
## [1] "marital against default"
## [1] "marital against housing"
## [1] "marital against loan"
## [1] "marital against contact"
## [1] "marital against month"
## [1] "marital against day_of_week"
## [1] "marital against poutcome"
## [1] "education against job"
## [1] "education against marital"
## [1] "education against default"
## [1] "education against housing"
## [1] "education against loan"
## [1] "education against contact"
## [1] "education against month"
## [1] "education against day_of_week"
## [1] "education against poutcome"
## [1] "default against job"
## [1] "default against marital"
## [1] "default against education"
## [1] "default against housing"
## [1] "default against loan"
## [1] "default against contact"
## [1] "default against month"
## [1] "default against day_of_week"
## [1] "default against poutcome"
## [1] "housing against job"
## [1] "housing against marital"
## [1] "housing against education"
## [1] "housing against default"
## [1] "housing against loan"
## [1] "housing against contact"
## [1] "housing against month"
## [1] "housing against day_of_week"
## [1] "housing against poutcome"
## [1] "loan against job"
## [1] "loan against marital"
## [1] "loan against education"
## [1] "loan against default"
## [1] "loan against housing"
## [1] "loan against contact"
## [1] "loan against month"
## [1] "loan against day_of_week"
## [1] "loan against poutcome"
## [1] "contact against job"
## [1] "contact against marital"
## [1] "contact against education"
## [1] "contact against default"
## [1] "contact against housing"
## [1] "contact against loan"
## [1] "contact against month"
## [1] "contact against day_of_week"
## [1] "contact against poutcome"
## [1] "month against job"
## [1] "month against marital"
## [1] "month against education"
## [1] "month against default"
## [1] "month against housing"
## [1] "month against loan"
## [1] "month against contact"
## [1] "month against day_of_week"
## [1] "month against poutcome"
## [1] "day_of_week against job"
## [1] "day_of_week against marital"
## [1] "day_of_week against education"
## [1] "day_of_week against default"
## [1] "day_of_week against housing"
## [1] "day_of_week against loan"
## [1] "day_of_week against contact"
## [1] "day_of_week against month"
## [1] "day_of_week against poutcome"
## [1] "poutcome against job"
## [1] "poutcome against marital"
## [1] "poutcome against education"
## [1] "poutcome against default"
## [1] "poutcome against housing"
## [1] "poutcome against loan"
## [1] "poutcome against contact"
## [1] "poutcome against month"
## [1] "poutcome against day_of_week"
Numeric variables
## [1] "euribor3m"
## [1] "nr.employed"
## [1] "cons.price.idx"
## [1] "cons.conf.idx"
## [1] "previous"
## [1] "pdays"
## [1] "campaign"
## [1] "duration"
## [1] "age"
Factor variables
“poutcome” - stays in
“day_of_week”
Continuous variables
0, 1, 2+“age” - stays in
Contact info - dropping all - because they are unknown at prediction time in our use case:
“campaign” “duratrion”
Goal: a classification problem model - yes/no. Also - based on qualitive analysis - high class inbalance. Therefore accuracy metric should NOT be used. Area under curve and ROC will be used for model selection.
Approach: 1. training 5 fold cross-validation on training sample 2. model choice using testing sample
## CART
##
## 14822 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 11858, 11857, 11858, 11858, 11857, 11857, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.004196643 0.7053988 0.9896152 0.1900394
## 0.005095923 0.7050815 0.9907251 0.1792527
## 0.055455635 0.6193921 0.9945416 0.0978400
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.004196643.
(radial)
## Support Vector Machines with Radial Basis Function Kernel
##
## 14820 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 11857, 11856, 11856, 11855, 11856, 11856, ...
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 0.6458148 0.9960003 0.02482063
## 0.50 0.6464443 0.9916052 0.05910162
## 1.00 0.6386480 0.9917727 0.05177789
##
## Tuning parameter 'sigma' was held constant at a value of 6.095125e-07
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 6.095125e-07 and C = 0.5.
## Random Forest
##
## 14820 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 11855, 11857, 11856, 11857, 11855
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.7742432 0.9916362 0.1588678
## 17 0.7716569 0.9655567 0.2967524
## 33 0.7673464 0.9649483 0.2889572
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Stochastic Gradient Boosting
##
## 14820 samples
## 14 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 11856, 11856, 11855, 11857, 11856, 11855, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees ROC Sens Spec
## 1 50 0.7812019 0.9897810 0.1875284
## 1 100 0.7879244 0.9880019 0.2043114
## 1 150 0.7883184 0.9871654 0.2115060
## 2 50 0.7875218 0.9879106 0.2055137
## 2 100 0.7898940 0.9862074 0.2244520
## 2 150 0.7900628 0.9856904 0.2322472
## 3 50 0.7885373 0.9870438 0.2184532
## 3 100 0.7907410 0.9855079 0.2350031
## 3 150 0.7910849 0.9843674 0.2371660
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
Random forrest shows highest Area Under Curve. Choosing Random Forest for uplift calculation.
Uplift = increase in probability of “buying” due to the campaign action.
Approach: 1. using model train on campaign set - model predicted probabilities assume the subject (row) was exposed to campaign. 2. using control set data predict probabilities with random forest model to simulate applying campaign to control set.
Rows were removed in campaign set - to match training sample structure. In real life application the prediction would be “0” (or “no”) for these rows - to avoid over optimistic uplift estimation - here omitted for simplicity.
Calculate as simple ratio of “yes” in control group.
Calculated average base probability of ‘term deposit’ is 0.0928171
We are taking 16476 rows to match the number of calls made in the refference campaign. This represents applying the same campaign cost to a new campaign. In other words we are choosing 16476 most promising future customers/buyers.
Obtained uplift = 0.0885294. Calculated as mean of per row differences between base and predicted probability of “term deposit” for top 16476 predicted probabilities.
These are PROPOSED next steps to improve this analysis